Contents

1 Introduction

This vignette show how to MOFA can be used to disentangle patient heterogeneity in a multi-omic data set on chronic lymphocytic leukemia (CLL). It contains all major analysis steps and findings presented in Argelaguet, Velten et al. (2018) on the CLL data.

library(MOFAtools)
library(dplyr)
library(reshape2)
library(gridExtra)
library(ggplot2)
library(GGally)
library(magrittr)
library(cowplot)
library(beeswarm)
library(ggpubr)
library(ClusterR)
library(pheatmap)
library(ggbeeswarm)
library(RColorBrewer)
library(tidyr)
library(glmnet)
library(survival)
library(Hmisc)
library(BloodCancerMultiOmics2017)
source("plotting_utils.R")
options(stringsAsFactors = FALSE)

2 Step 1: Load data and create MOFA object

Here we load the pre-processed CLL data which is part of the MOFAtools pacakges. The original raw data can be obtained from Dietrich et al, 2018 and are available at the European Genome-phenome Archive under accession EGAS00001001746 and data tables as R objects can be downloaded from http://pace.embl.de/ and the R package BloodCancerOmics2017. Briefly, we included the following data for the training of MOFA: Mutations (>= 3 occurences), Methylation (top 1%, no sex chromosomes), mRNA (top 5000 most variable genes excludign the Y chromosome) and drug responses for all samples that have data in at least two views (N=200). The scripts for pre-processing the data starting from the original study can be found here.

# Load data: list containing matrices for mRNA, Methylation, Drug Response and Mutation data
data("CLL_data")
# samples are columns, features are rows
sapply(CLL_data, dim)
##      Drugs Methylation mRNA Mutations
## [1,]   310        4248 5000        69
## [2,]   200         200  200       200

There are two options to input data to MOFA:

2.1 Option 1: Create MOFA from a list of matrices

If using the base R approach, you simply need to provide a list of matrices where features are rows and samples are columns. Importantly, the samples need to be aligned. Missing values/assays should be filled with NAs.

MOFAobject <- createMOFAobject(CLL_data)
## Creating MOFA object from list of matrices, please make sure that samples are columns and features are rows...
## Untrained MOFA model with the following characteristics: 
##  Number of views: 4 
##  View names: Drugs Methylation mRNA Mutations 
##  Number of features per view: 310 4248 5000 69 
##  Number of samples: 200
MOFAobject
## Untrained MOFA model with the following characteristics: 
##  Number of views: 4 
##  View names: Drugs Methylation mRNA Mutations 
##  Number of features per view: 310 4248 5000 69 
##  Number of samples: 200

2.2 Option 2: Create MOFA from a MultiAssayExperiment

If using the Bioconductor approach, you need to provide or create a MultiAssayExperiment object and then use it to build the MOFA object. For example, starting from a list of matrices where features are rows and samples are columns, this can be easily constructed as follows:

library(MultiAssayExperiment)

 # Load sample metadata (colData): Sex and Diagnosis
data("CLL_covariates")
head(CLL_covariates)
##      Gender Diagnosis
## H045      m       CLL
## H109      m       CLL
## H024      m       CLL
## H056      m       CLL
## H079      m       CLL
## H164      f       CLL
# Create MultiAssayExperiment object 
mae_CLL <- MultiAssayExperiment(experiments = CLL_data,
                                colData = CLL_covariates)

# Build the MOFA object
MOFAobject <- createMOFAobject(mae_CLL)
## Creating MOFA object from a MultiAssayExperiment object...
## Untrained MOFA model with the following characteristics: 
##  Number of views: 4 
##  View names: Drugs Methylation mRNA Mutations 
##  Number of features per view: 310 4248 5000 69 
##  Number of samples: 200

2.3 Overview of training data

The function plotTilesData can be used to obtain an overview of the data stored in the object for training (as in Figure 2a). For each sample it shows in which views data are available.

myPalette <- c(RColorBrewer::brewer.pal(8,"Dark2"),RColorBrewer::brewer.pal(9,"Set1"), RColorBrewer::brewer.pal(3,"Greens")[3])
cols4omics <- myPalette[c(6,10,18,2)]
plotTilesData(MOFAobject, colors=cols4omics)

3 Step 2: Training of MOFA and model selection

Once an untrained MOFAobject has been created, the next step is model training. This part of the pipeline is implemented in Python, so first of all make sure you have the corresponding package installed (see installation instructions in the README file). The model can either be trained directly from the command line (see running instructions, it is very simple) or using following R wrapper.

3.1 Model training

3.1.1 Specify in- and output paths

In DirOptions places can be specified for saving the training data as .txt files (as required for the model training in Python) and the resulting fitted models as .hdf5 files. This can also be set to temporary directories as below.

DirOptions <- list(
  "dataDir" = tempdir(), # Folder to store the input matrices as .txt files, it can be a simple temporary folder
  "outFile" = tempfile() # Output file of the model (use .hdf5 extension)
)

3.1.2 Define model options

The most important options the users can define are:

  • numFactors: initial number of factors (default of 25)
  • likelihoods: likelihood for each view, gaussian for continuous data, bernoulli for binary data and poisson for count data. If not provided, the model tries to guess it from the data.
  • sparsity: Are sparse factor loadings desired (default YES)
  • learnIntercept: Should an intercept be learnt in the model (default TRUE). Setting this to FALSE is usually only recommended in the case of centered gaussian data.
ModelOptions <- getDefaultModelOptions(MOFAobject)
## Likelihoods guessed automatically: gaussian gaussian gaussian bernoulli
ModelOptions
## $likelihood
##       Drugs Methylation        mRNA   Mutations 
##  "gaussian"  "gaussian"  "gaussian" "bernoulli" 
## 
## $learnIntercept
## [1] TRUE
## 
## $numFactors
## [1] 25
## 
## $sparsity
## [1] TRUE
## 
## $covariates
## NULL

3.1.3 Define training options

The most important options the users can define are:

  • maxiter: Maximum number of iterations, ideally set it large enough and use the convergence criteria
  • tolerance: Convergence threshold based on change in the evidence lower bound. We recommend you stick to the default value.
  • DropFactorThreshold: Threshold on fraction of variance explained to define inactive factors. That is, factors explaining less than ‘DropFactorThreshold’ percentage of variation in all views will be dropped during training.
  • learnFactors: Should the number of factor be learnt during training? (default YES). If this is disabled, the model will keep all factors as defined in ModelOptions$numFactors.
  • seed: A seed for model initilization to be able to reproduce the exact same fit.
TrainOptions <- getDefaultTrainOptions()
TrainOptions
## $maxiter
## [1] 5000
## 
## $tolerance
## [1] 0.1
## 
## $learnFactors
## [1] TRUE
## 
## $DropFactorThreshold
## [1] 0.02
## 
## $verbose
## [1] FALSE
## 
## $seed
## NULL

3.1.4 Define data options

Options on the data

DataOptions <- getDefaultDataOptions()
DataOptions
## $delimiter
## [1] "\t"
## 
## $centerFeatures
## [1] FALSE
## 
## $scaleViews
## [1] FALSE
## 
## $removeIncompleteSamples
## [1] FALSE

3.1.5 Prepare MOFA

prepareMOFA internally performs a set of sanity checks, fills the TrainOpts and ModelOpts slots of the MOFAmodel object and it also creates a set of temporary files with the input matrices that will be loaded by the Python core implementation

MOFAobject <- prepareMOFA(MOFAobject, 
                      DirOptions = DirOptions,
                      ModelOptions = ModelOptions,
                      TrainOptions = TrainOptions,
                      DataOptions = DataOptions
)
## Checking data options...
## Checking training options...
## Checking model options...
## Storing input views in /var/folders/2b/w71l3_4s4_15g0rrd9lczpzw0000gn/T//RtmpCZZh2A...

3.1.6 Run MOFA

This step can take some time (around 40 min with default parameters), for illustration we provide an existing trained model. mofaPath specifies the path to the mofa installation (e.g. mofaPath = “/Users/XY/anaconda2/bin/mofa”).

# not run
MOFAobject <- runMOFA(MOFAobject, DirOptions, mofaPath = "/Users/bvelten/anaconda2/bin/mofa")

3.2 Model selection

We recommend running MOFA multiple times on a dataset and selecting a final model for analysis based on the ELBO statistics as stored in each trained MOFA object. Here, we load the trained models as resulting from the last step, explore their consistency and select a model for downstream analyses.

3.2.1 Load trained models

In total, 25 models were trained on patients on the CLL data with the default options. In particular, the number of factors and an intercept term is learned, dropping inactive factors at a threshold of 2% explained variation. For the methylation M-values, the normalized expression values and the durg responses we used a Gaussian likelihood, the mutations are modelled by a Bernoulli likelihood (see detail model and training parameters above)

Here we show the steps to select the final model. They are not evaluated but once several models have been trained, mofaDir can be set to their directory and the following chunks can be run.

mofaDir <- "../data/MOFA_fits/"
files <- list.files(mofaDir)
files <- files[grepl(".hdf5",files)]

# load all trained MOFA object
AllModels <- lapply(files, function (filenm){
  modeltmp <- loadModel(file.path(mofaDir, filenm), sortFactors = T)
})
names(AllModels) <- sub(".hdf5","", files)

3.2.2 Optional: Change feautre names of the models to nicer labels

featureNames and factorNames can be used to change the names in the MOFAobject.

# get gene annotations from ENSEMBL
mRNA_file <- "../data/Hsapiens_genes_BioMart.75.txt"
mRNA = read.csv(file=mRNA_file,header=T,sep="\t",stringsAsFactors=F)

# get drug annotations from BloodCancerMultiOmics2017 package
data("drugs", package = "BloodCancerMultiOmics2017")
# set nice feature names
AllModels <- lapply(AllModels, function(model){
  #mRNA
  featureNames(model)$mRNA <- mRNA$symbol[match(featureNames(model)$mRNA, mRNA$ens_id)]
  # Drugs
  featureNames(model)$Drugs <- paste(drugs[substr(featureNames(model)$Drugs,1,5),"name"],
                                              substr(featureNames(model)$Drugs,6,8), sep="")
  # Mutations
  featureNames(model)$Mutations[featureNames(model)$Mutations=="del13q14_any"] <- "del13q14"
  model
})

3.2.3 Check the consistency of the results from different trials

As the optimization landscape is non-convex, the results from different trials will vary depending on the random initilizations. Here, we test the consistency of factors across trials and visualize the number of inferred factors and the ELBO for the different models.

compare_models(AllModels)
compare_factors(AllModels)

3.2.4 Select a model based on highest ELBO

Based on the ELBO we select the best model for downstream analyses using the select_model function.

MOFAobject <- select_model(AllModels)
MOFAobject

The model selected model in the manuscript is also part of the MOFAtools package and can be directly loaded as follows. Note that this model has depreceated parameters compared to the current MOFA version. Its dropping parameter correspond to 2% in the updated version.

# not run
filepath <- system.file("extdata", "CLL_model.hdf5", package = "MOFAtools")
MOFAobject <- loadModel(filepath, MOFAobject, r2_threshold=0.02)

  # set nice names
  featureNames(MOFAobject)$mRNA <- mRNA$symbol[match(featureNames(MOFAobject)$mRNA, mRNA$ens_id)]
  featureNames(MOFAobject)$Drugs <- paste(drugs[substr(featureNames(MOFAobject)$Drugs,1,5),"name"],
                                              substr(featureNames(MOFAobject)$Drugs,6,8), sep="")
  featureNames(MOFAobject)$Mutations[featureNames(MOFAobject)$Mutations=="del13q14_any"] <- "del13q14"

3.3 Other diagnostic checks of the trained model

3.3.1 Robustness of factors to downsampling of samples (Appendix Figure S7)

See here.

3.3.2 Correlation of latent factors

fout <- plotFactorCor(MOFAobject)

3.3.3 Impact on factors when training on a subset of views (Appendix Figure S8)

See here.

3.4 Further preparations

3.4.1 Flip factor 1

AS factors are rotationally invariant, weights and factor values can only interpreted relative to one another. To align factors with other covariates, we can flip them by changin the sign of the factor values and their weights, as is done here for Factor 1 to align it with the IGHV status in terms of patient’s hazard ratio.

MOFAobject <- MOFAtools:::flip_factor(MOFAobject, "1")

3.4.2 Extracting important parts of the model and covariates

For ease-of-use we extract some interesting components of the fitted model

# Training data
data <- getTrainData(MOFAobject)
# Number of factors
K <- getDimensions(MOFAobject)$K
# Factor matrix (N x K)
Z <- getFactors(MOFAobject)
# List of weight matrics
weights <- getWeights(MOFAobject)

3.4.3 Extracting external information on the samples and features

The BloodCancerMultiOmics2017 R package contains further data on the samples and feautres that was not used in training and can be helpful for annotation of the factors and downstream analyses such as prediction of clinical outcomes.

data("patmeta", package = "BloodCancerMultiOmics2017")
recurrent_muts <- t(data$Mutations[rowSums(data$Mutations, na.rm = T) >=5, ])
covariates <- cbind(recurrent_muts[,!grepl("IGHV", colnames(recurrent_muts))], patmeta[rownames(Z),])
colnames(covariates)[grep("Gender",colnames(covariates) )] <- "sex"
covariates$sex <- ifelse(covariates$sex == "m", 0,1)
colnames(covariates)[grep("Age4Main",colnames(covariates) )] <- "age"
colnames(covariates)[grep("ConsClust",colnames(covariates) )] <- "MethylationCluster"
covariates <- covariates[, !grepl("T5|T6|Age4Pilot|Age4Main|Diagnosis|died|treated|treatedAfter", colnames(covariates) )]
covariates$missingViews <- rowSums(sapply(data, function(dat) apply(is.na(dat), 2, all)))
covariates$patID <- rownames(covariates)

4 Step 3: Overview of the trained MOFAobject (as in Figure 2)

After training, we can explore the results from MOFA. Here we provide a semi-automated pipeline to disentangle and characterize the sources of variation (the factors) identified by MOFA. Part 1: Disentangling the heterogeneity: - Calculation of variance explained by each factor in each view Part 2: Characterisation of individual factors: - Inspection of top weighted features in the active views - Ordination of samples by factors to reveal clusters and graadients in the sample space - Feature set enrichemnt analysis in the active views (where set annotations are present, e.g. gene sets for mRNA views)

For details, please read the Methods section of our paper.

4.1 R2 decomposition (Figure 2b,c)

plotVarianceExplained(MOFAobject)

r2out <- calculateVarianceExplained(MOFAobject)
# flipped version as presented in the manuscript
p <- plotFlippedR2(MOFAobject, r2.out)
print(p)

4.2 Weights on Mutations view for the first two factors (Figure 2d)

For small views: Direct inspection of weight is a good way to annotate factors, e.g. here for facotrs 1 and 2 acive in the Mutations view.

# Mutation weights on factor 1
showTopWeightsAndColor(MOFAobject, "Mutations", "1" , nfeatures = 5,
                       abs=T, Features2color = "IGHV",
                      maxL = 1, scalePerView=T)

# Mutation weights on factor 2
showTopWeightsAndColor(MOFAobject, "Mutations", "2" , nfeatures = 5,
                       abs=T, Features2color = "trisomy12",
                       maxL = 1, scalePerView=T)

4.3 Scatterplot of two most important facotrs (Figure 2e)

# collect data for plot
df = data.frame(x = Z[, "1"], y = Z[,"2"], 
                trisomy12 = covariates$trisomy12,
                IGHV = covariates$IGHV)

# nice names for tr 12
df$trisomy12[is.na(df$trisomy12)] <- "missing" 
df$trisomy12[df$trisomy12 == "0"] <- "wt" 
df$trisomy12[df$trisomy12 == "1"] <- "tr12" 
df$trisomy12 <- factor(df$trisomy12, levels=c("wt", "tr12", "missing"))

# nice names for IGHV
df$IGHV[is.na(df$IGHV)] <- "missing" 
df$IGHV[df$IGHV == "0"] <- "U" 
df$IGHV[df$IGHV == "1"] <- "M" 
df$IGHV <- factor(df$IGHV, levels=c("U", "M", "missing"))

# nice names for combi
df$IGHV_tr12 <- paste(df$IGHV, df$trisomy12, sep="-CLL, ")
df %<>% mutate(IGHV_tr12= ifelse(grepl("missing", IGHV_tr12), "missing",IGHV_tr12))

# colors and shapes for plot
Paircolorr <- c(RColorBrewer::brewer.pal(6, "Paired")[c(1:2,5:6)], "grey")
Shapes4plot <- c(17,19,17,19, 3)
names(Shapes4plot) <- names(Paircolorr) <- c("U-CLL, wt", "U-CLL, tr12", "M-CLL, wt", "M-CLL, tr12", "missing")
df$IGHV_tr12 <- factor(df$IGHV_tr12, levels=names(Paircolorr))

# make plot
titlesize <- 15
gg <-  ggplot(df, aes(x=x, y=y, col=IGHV_tr12,shape=IGHV_tr12)) +
  geom_point(size=2.5)  +
  theme(plot.margin = margin(20, 20, 10, 10), 
        axis.text = element_text(size = rel(1), color = "black"),
        axis.title = element_text(size = titlesize), 
        axis.title.y = element_text(size = rel(1.1),
                                        margin = margin(0, 10, 0, 0)), 
        axis.title.x = element_text(size = rel(1.1),
                                        margin = margin(10, 0, 0, 0)), 
        axis.line = element_line(color = "black", size = 0.5), 
        axis.ticks = element_line(color = "black", size = 0.5),
        panel.border = element_blank(), 
        panel.grid.major = element_blank(),
        panel.grid.minor = element_blank(), 
        panel.background = element_blank(),
        legend.key = element_rect(fill = "white"),
        legend.text = element_text(size = titlesize),
        legend.title = element_blank(),
        legend.position = "top") +
  scale_shape_manual(name = "Mutational status",
                     labels= names(Shapes4plot),
                     values=Shapes4plot) +
  scale_color_manual(name = "Mutational status",
                     labels= names(Paircolorr),
                     values=Paircolorr) +
  guides(colour = guide_legend(ncol=3),
          shape = guide_legend(ncol=3)) +
  xlab("Factor 1") + ylab("Factor 2") 

print(gg)

4.4 Enrichment Analysis (Figure 2f)

We used the reactome gene set for enrichment analysis. Other gene sets can be used as input to FeatureSetEnrichmentAnalysis.

# Get reactome gene sets
data("reactomeGS", package="MOFAtools")

# use ensIDs instead of gene symbols
# use ensIDs
if(!all(grepl("ENS", MOFAtools::featureNames(MOFAobject)$mRNA))) {
    symbols <- MOFAtools::featureNames(MOFAobject)$mRNA
    MOFAtools::featureNames(MOFAobject)$mRNA <- mRNA$ens_id[match(MOFAtools::featureNames(MOFAobject)$mRNA, mRNA$symbol)]
}

Calculate Enrichment

gsea.out <- FeatureSetEnrichmentAnalysis(MOFAobject, "mRNA",reactomeGS,
                                         statistical.test = "parametric",
                                         alpha = 0.01, min.size = 15)
names(gsea.out$sigPathways) <- colnames(Z)[-1]

# set feature names back to gene symbols
MOFAtools::featureNames(MOFAobject)$mRNA <- symbols

Make a barplot of enriched gene sets per factor (Figure 2f). For color annotations we define broad categories of gene sets manually.

#broader categories
label_pathway <- function(x){
  ifelse(grepl("[s|S]tress|HSF|[S|s]enescence|[T|t]elomer|Attenuation phase", x), "stress_aging",
         ifelse(grepl("egulat|Polymerase", x) & grepl("RNA", x), "RNA_regulation",
                ifelse(grepl("Immun|TCR|Interleukin|IL",x), "ImmuneResponse", "other")))
}
categories <- sapply(rownames(reactomeGS), label_pathway)

# write categories to .csv files
# write.csv(file=file.path("stress_aging.csv"),
#           names(categories[which(categories=="stress_aging")]))
# write.csv(file=file.path("ImmuneResponse.csv"),
#           names(categories[which(categories=="ImmuneResponse")]))
# write.csv(file=file.path("RNA_regulation.csv"),
#           names(categories[which(categories=="RNA_regulation")]))
# write.csv(file=file.path("other.csv"),
#           names(categories[which(categories=="other")]))

# color definition
col4Pathways <- c("other"="gray",
                  "cellular stress/senescence"="cyan",
                  "RNA regulation"="navy",
                  "Immune Response"="forestgreen")

nicelabels <- c(other="other",
                stress_aging="cellular stress/senescence",
                ImmuneResponse="Immune Response",
                RNA_regulation="RNA regulation" )

Barplot of enriched gene sets on each factor

# collect results in dataframe
pathwaysDF <- melt(gsea.out$sigPathways, value.name="pathway")
colnames(pathwaysDF) <- c("pathway", "factor")
pathwaysDF %<>% mutate(type=label_pathway(pathway))

# summarize per factor
pathwaysSummary <- pathwaysDF %>% dplyr::group_by(factor) %>%
  dplyr::summarise(other = sum(type=="other"),
                   stress_aging=sum(type=="stress_aging"),
                   ImmuneResponse = sum(type == "ImmuneResponse"),
                   RNA_regulation = sum(type=="RNA_regulation"))

# add factors without enrichment
df_none <- data.frame(factor = colnames(Z)[-1][!colnames(Z)[-1] %in% pathwaysSummary$factor],
                      other = 0, stress_aging=0,
                      ImmuneResponse=0, RNA_regulation=0)
pathwaysSummary <- rbind(pathwaysSummary, df_none)
pathwaysSummary %<>%  melt(id.vars = "factor", variable.name = "type", value.name = "count")
pathwaysSummary$factor <- factor(pathwaysSummary$factor, levels = 1:K)
pathwaysSummary %<>% mutate(type=as.character(nicelabels[type]))
pathwaysSummary$type <- factor(pathwaysSummary$type, levels=rev(as.character(nicelabels)))

ggplot(pathwaysSummary, aes(x=factor, y=count, fill=type)) +
  geom_bar(stat="identity") +
  ylab("Enriched gene sets at FDR 1%") +
  xlab("Factor") +
  scale_fill_manual(values=col4Pathways) +
  theme(legend.position = "top") +
  guides(fill=guide_legend(title="",nrow=2))

5 Further analyses

5.1 In-depth analysis of selected MOFA factors

5.1.1 Factor 1 (Figure 3)

5.1.1.1 Clustering of patients based on Factor 1 (Figure 3a)

Check which K is optimal in K-means clustering using a BIC criterion.

bic <- Optimal_Clusters_KMeans(Z[,"1", drop=F], 7, criterion="BIC")

df_bic <- data.frame(BIC = bic[1:length(bic)], K=1:length(bic))
ggplot(filter(df_bic, K>1), aes(x=K, y=BIC)) +geom_line(linetype="dashed") +geom_point() + theme_bw()

Based on the value of factor 1 samples are classified into 3 groups using kmeans.

# kmeans to determine 3 factor clusters
set.seed(32180)
ZMC <- kmeans(Z[,"1"], 3, nstart=1000, iter.max = 1000)
ZMC <- ZMC$cluster
ZMC <- ifelse(ZMC==1, "HZ", ifelse(ZMC==2, "LZ", "IZ"))

# Comparison to kwnon sample groups based on the methylation cluster
# methylation cluster
MC <- covariates[, "MethylationCluster"]
MC[is.na(MC)] <- "missing"
names(MC) <- rownames(covariates)

table(MC, ZMC)
##          ZMC
## MC        HZ IZ LZ
##   HP      75  0  0
##   IP       2 42  1
##   LP       0  0 76
##   missing  1  2  1

Beeswarm plot (Fig. 3a and Appendix Figure S11)

col4Clusters <- c(LZ="navy", IZ="darkgoldenrod", HZ="darkred")
MCcolors <- col4Clusters[ZMC]

#make plot
par(mar=c(2.3, 4.5, 4, 2), xpd=TRUE)
bs <- beeswarm(Z[,"1"], pwcol = MCcolors, pch = 16,
          ylab = paste("Factor", "1"), xlab = "",
          cex.lab=1.5, cex.axis=1.5, cex.main=1.5, cex.sub=1.5)

legend("top", legend = c("LZ", "IZ", "HZ"),
       title = "Factor clusters", pch = 16,
       col = col4Clusters, ncol = 3,
       inset=c(0,-0.2), cex=1.2,
       box.lwd = 0, box.col = "white")

As a comparison: Beeswarm plot of factor 1 with samples colored based on their Methylation Cluster. (To compare with the clusters based on factor 1)

# get methylation cluster and define colors
MC <- covariates[,"MethylationCluster"]
MC[is.na(MC)] <- "missing"
MC <- factor(MC, levels = c("LP", "IP", "HP", "missing"))
names(MC) <- rownames(covariates)
col4MC <- c(LP="navy", IP="darkgoldenrod", HP="darkred", "missing"="gray")
MCcolors <- col4MC[MC]

#make plot
par(mar=c(2.3, 4.5, 4, 2), xpd=TRUE)
bs <-beeswarm(Z[,"1"], pwcol = MCcolors, pch = 16,
          ylab = paste("Factor", "1"), xlab = "",
          cex.lab=1.5, cex.axis=1.5,
          cex.main=1.5, cex.sub=1.5)

legend("top", legend =  levels(MC)[1:3],
       title = "Methylation Cluster", pch = 16,
       col = col4MC[1:3], ncol = 4,
       inset=c(0,-0.2), cex=1.2,
       box.lwd = 0, box.col = "white")

Samples breakdown by number of views

stopifnot(all(names(ZMC)==names(MC)))
df_MC <- data.frame(MC=MC, ZMC=ZMC, patID=names(MC))
df_MC <- cbind(df_MC,t(sapply(df_MC$patID, function(p) sapply(MOFAobject@TrainData, function(l) !all(is.na(l[,p]))))))

table(df_MC$ZMC, df_MC$Drugs)
##     
##      FALSE TRUE
##   HZ     7   71
##   IZ     0   44
##   LZ     9   69
table(df_MC$ZMC, df_MC$mRNA)
##     
##      FALSE TRUE
##   HZ    29   49
##   IZ    15   29
##   LZ    20   58
table(df_MC$ZMC, df_MC$Methylation)
##     
##      FALSE TRUE
##   HZ     1   77
##   IZ     2   42
##   LZ     1   77
table(df_MC$ZMC, df_MC$Mutations)
##     
##      TRUE
##   HZ   78
##   IZ   44
##   LZ   78
table(df_MC$ZMC)
## 
## HZ IZ LZ 
## 78 44 78

5.1.1.2 Characterization in the mRNA view (Figure 3b,c)

Top Weights

# number of top genes to show
ntop_mRNA <- 12

# list of previously mentioned IGHV associated genes (see references in ms)
knownGenes <- c("AKAP12", "ADAM29", "BCL7A", "CLECSF2", "FCGBP", "FLJ10884",
                "FUT8", "LPL", "TCF7", "WNT3", "APOD", "SPG20", "MYL9", "NRIP1",
                "SPAP1", "SPRY2", "TGFBR3", "ZAP70", "COBLL1", "ZNF667", "SEPT10",
                "CRY1", "PLD1", "BCL7A", "WNT9A")

# plot top weights
p <- showTopWeightsAndColor(MOFAobject, "mRNA", "1",
                            nfeatures = ntop_mRNA,
                            Features2color = knownGenes,
                            scalePerView = T,
                            col2highlight = "darkorange",
                            orderBySign = TRUE)
p

Data Heatmap

# get top genes and patient having RNAseq
topGenes <- names(sort(abs(MOFAobject@Expectations$W$mRNA[,"1"]), decreasing = T))[1:ntop_mRNA]
topGenes <- topGenes[order(MOFAobject@Expectations$W$mRNA[topGenes,"1"], decreasing = T)]
patients2include <- colnames(data$mRNA)[apply(data$mRNA,2, function(p) !any(is.na(p)))]

# annotate by IGHV factor 
anno_df <- data.frame(Z=Z[,"1"], ZMC=ZMC)
colnames(anno_df) <- c("Factor 1", "Clusters")
rownames(anno_df) <- rownames(Z)
annoHM_colors <- list(c("blue", "red"), col4Clusters)
names(annoHM_colors) <- c("Factor 1", "Clusters")

# heatmap
pheatmap(data$mRNA[topGenes, patients2include[order(Z[patients2include, "1"])] ],
         show_colnames = F, cluster_rows = F, cluster_cols = F, fontsize = 18,
         annotation_col = anno_df, annotation_legend = F,
         gaps_col = c(which(ZMC[names(sort(Z[patients2include,"1"], decreasing = F))]=="IZ")[1]-1,
                      which(ZMC[names(sort(Z[patients2include,"1"], decreasing = F))]=="HZ")[1]-1),
         show_rownames = T, legend = T, annotation_colors = annoHM_colors)

5.1.1.3 Characterization in the methylation view

Top Weights

# number of top sites to show
ntop_meth <- 12

# plot top weights
p <- showTopWeightsAndColor(MOFAobject, "Methylation", "1",
                            nfeatures = ntop_meth,
                            scalePerView = T)
p

Data Heatmap

# get top sites and patient having methylation data
topSites <- names(sort(abs(MOFAobject@Expectations$W$Methylation[,"1"]), decreasing = T))[1:ntop_meth]
patients2include <- colnames(data$Methylation)[apply(data$Methylation,2, function(p) !any(is.na(p)))]

# annotate by IGHV factor 
anno_df <- data.frame(Z=Z[,"1"], ZMC=ZMC)
colnames(anno_df) <- c("Factor 1", "Clusters")
rownames(anno_df) <- rownames(Z)
annoHM_colors <- list(c("blue", "red"), col4Clusters)
names(annoHM_colors) <- c("Factor 1", "Clusters")

# heatmap
pheatmap(data$Methylation[topSites, patients2include[order(Z[patients2include, "1"])] ],
         show_colnames = F, cluster_rows = F, cluster_cols = F, fontsize = 18,
         annotation_col = anno_df, annotation_legend = F,
         gaps_col = c(which(ZMC[names(sort(Z[patients2include,"1"], decreasing = F))]=="IZ")[1]-1,
                      which(ZMC[names(sort(Z[patients2include,"1"], decreasing = F))]=="HZ")[1]-1),
         show_rownames = T, legend = T, annotation_colors = annoHM_colors)

5.1.1.4 Characterization in the drug response view (Figure 3d,e)

For drugs we show the average weight across all concentrations

plotDrugWeights(MOFAobject, "1", ntop=15, broad_categories = TRUE)

Drug Response Curves

  data(conctab, package="pace")
  groups <- ZMC
  groupnm <- "clusters"
  drugResDF <- lapply(c("dasatinib", "AZD7762"), function(drugnm) {
    drugData2plot <-MOFAobject@TrainData$Drugs[grepl(drugnm,rownames(MOFAobject@TrainData$Drug)),]
    drugid <- rownames(drugs[drugs$name==drugnm, ])
  
    drugDF <- melt(drugData2plot, varnames = c("drug", "patient"), value.name = "viability")
    drugDF %<>% mutate(concentrationID = as.numeric(sapply(as.character(drug), function(x) strsplit(x, "_")[[1]][2])))
    drugDF %<>% mutate(concentration = as.numeric(conctab[drugid,paste0("c", concentrationID)]))
    if(!is.null(groups)) drugDF %<>% mutate(group = as.factor(groups[patient])) else drugDF$group <- factor(1)

    drugDF %<>% filter(!is.na(viability) & !is.na(group))
    summary_drugDF <-  drugDF %>% group_by(group, concentrationID, concentration) %>%
            dplyr::summarize(mean_viab = mean(viability), sd = sd(viability), n = length(viability))
    summary_drugDF$se <- summary_drugDF$sd/sqrt(summary_drugDF$n)
    summary_drugDF$drug <- drugnm
    summary_drugDF
}) %>% bind_rows()
  
  p <- ggplot(drugResDF, aes(x=concentration, y=mean_viab, col=group, grou=group)) +
    geom_errorbar(aes(ymin=mean_viab-2*se, ymax=mean_viab + 2*se), width=0.1)+ geom_line(size=2) +
    ylab("viability") +theme_bw(base_size = 21) + facet_wrap(~drug)+
    xlab(expression(paste("Concentration [",mu,"M]"))) #+ scale_x_reverse()
  if(is.null(groups)) p <- p + guides(col=F) else p <- p + guides(col=guide_legend(title =groupnm))
  
  p <- p + scale_color_manual(values = col4Clusters, labels=c("LZ", "IZ", "HZ")) +
    ylim(c(0,1.05)) + theme(legend.position = "top", axis.text = element_text(colour="black")) 
  
  print(p )

5.1.1.5 Continuity and robust inference based on a subset of views (Appendix Figure S12, S13)

see here

5.1.1.6 Prediction of IGHV status based on factor 1 (EV Figure 3)

Use the Factor 1 of MOFA to classify patients into 2 groups

# IGHV status
IGHV <- covariates[,"IGHV"]
IGHV[is.na(IGHV)] <- "missing"
names(IGHV) <- rownames(covariates)

# cluster using kmeans to determine 2 IGHV factor/cluster
set.seed(32180)
ZIGHV <- kmeans(Z[,"1"], 2, nstart=1000, iter.max = 1000)
ZIGHV <- ZIGHV$cluster

# label clusters as U and M-CLL
IGHV[IGHV==0] <- "U"
IGHV[IGHV==1] <- "M"
ZIGHV[ZIGHV==1]<-"M"
ZIGHV[ZIGHV==2]<-"U"

# collect labels and factor values in data-frame
df_clustering <- data.frame(IGHV_label = as.factor(IGHV),
                            IGHV_predicted = ZIGHV, 
                            patID = rownames(Z), Z = Z[,"1"])


#check agreement between clincal IGHV label and cluster results
clagree <- paste(ZIGHV ,IGHV, sep="_")
clagree <- ifelse(clagree %in% c("M_M", "U_U"), "agreement with label",
                  ifelse(grepl("missing",clagree), "label missing",
                         ifelse(clagree=="M_U", "U-CLL clustered as M-CLL", "M-CLL clustered as U-CLL" )))
df_clustering$agreement <- clagree

Define colors

#colors
colors_agreemet <- c("agreement with label" = "darkgreen", 
                               "label missing" ="gray", 
                               "U-CLL clustered as M-CLL" ="sienna",
                               "M-CLL clustered as U-CLL" = "orange")

#colors for heatmap annotations
anno_colors<- list(IGHV_label = c("M"="red", "U"="blue", "missing" = "gray"),
                   IGHV_predicted = c("M"="darkred", "U"="darkblue"),
                  agreement =colors_agreemet)

Beeswarm plots for agreement of factor groups with clincial IGHV label (Figure EV 3a)

# use colors defined above
col4bees <- anno_colors$agreement[df_clustering$agreement]

gg_bees <- ggplot(df_clustering, aes(x=1,y=Z, col=agreement)) + geom_beeswarm(size=2) +
  scale_color_manual(values=colors_agreemet) + 
  theme(axis.text.x = element_blank(),
        axis.title.x = element_blank(),
        axis.ticks.x = element_blank(),
        text = element_text(size=18))+
  guides(col=F) + ylab("Factor 1")
gg_bees

Pie plot of IGHV label agreement (Figure EV 3b)

df_pie <- df_clustering %>%
  group_by(agreement) %>%
  dplyr::summarize(value=length(patID))
df_pie$agreement <- factor(df_pie$agreement, levels = rev(df_pie$agreement))

gg_pie <- ggplot(df_pie, aes(x="", y=value, fill=agreement)) +
  geom_bar(stat="identity",width = 1)+
  coord_polar("y", start=0)+
  # scale_fill_brewer(palette="Dark2")+
  theme_minimal() + xlab("") + ylab("")+
  scale_fill_manual(values=colors_agreemet) +
  # ggtitle("Classification of IGHV status") +
  theme(text=element_text(size=22),
        legend.position = "bottom",
        axis.text = element_blank(),
        axis.line = element_blank()) +
  guides(fill=F,
         col=F) +
  geom_text(aes(y = cumsum(value)-value/2, x=1.7, label=value, col=agreement), size=6) +
  scale_color_manual(values=colors_agreemet)
gg_pie

Heatmaps of omic layers annotated with IGHV classification and Factor classification

  1. Methylation
methData <- data$Methylation
methData <- methData[,apply(methData,2, function(d) !all(is.na(d)))]
methHmLeg <- pheatmap(cor(methData),
         annotation_row = select(df_clustering, c(IGHV_label, IGHV_predicted, agreement)),
         show_rownames = F, show_colnames = F, main = "Methylation",
         annotation_colors  =  anno_colors, annotation_legend = T, treeheight_col=0
         )

  1. Drugs
drugData <- data$Drugs
drugData <- drugData[,apply(drugData,2, function(d) !all(is.na(d)))]
drugHmLeg <- pheatmap(cor(drugData),
         annotation_row = select(df_clustering, c(IGHV_label, IGHV_predicted, agreement)),
         show_rownames = F, show_colnames = F, main = "Drug response", legend = T,
         annotation_colors  =  anno_colors, annotation_legend = T, treeheight_col=0,
        annotation_names_row=F, treeheight_row = 0
         )

  1. mRNA
mRNAData <- data$mRNA
mRNAData <- mRNAData[,apply(mRNAData,2, function(d) !all(is.na(d)))]
mRNAHm <- pheatmap(cor(mRNAData),
         annotation_row = select(df_clustering, c(IGHV_label, IGHV_predicted, agreement)),
         show_rownames = F, show_colnames = F, main = "mRNA",
         annotation_colors  =  anno_colors, annotation_legend = F, treeheight_col=0,
         annotation_names_row=F, treeheight_row = 0
         )

5.1.2 Factor 3 (Appendix Figure S15)

This facotr is mainly acitve in the drug response view (see variance decomposition) #### Weights on drug responses

Nearly all drug-consentration pairs have positive weight on this factor.

ggW <- MyplotWeights(MOFAobject, "Drugs", "3", nfeatures = 0) +
  geom_hline(yintercept = 0, color="darkred", linetype="dashed") 

5.1.2.1 Comparison to general drug level sensitivity

dd <- data$Drugs
df_GDLS <- data.frame(GLDS= colMeans(dd), Factor.3 = Z[,"3"])
ggGLDS <- ggplot(df_GDLS, aes(x=Factor.3, y=GLDS)) + geom_point() +geom_smooth(method = 'lm') +
  annotate("text",x=-3,y=0.8,label=paste("cor=",round(cor(df_GDLS$GLDS,df_GDLS$Factor.3, use="complete.obs"),2)), col="blue", cex=5)+
  xlab("Factor 3") + ylab("General drug sensitivity")

5.1.2.2 Arrange plot

plot_grid(ggW, ggGLDS, align = "hv", axis = "l", nrow=2, labels=c("a","b"))
## Warning: Removed 16 rows containing non-finite values (stat_smooth).
## Warning: Removed 16 rows containing missing values (geom_point).

5.1.3 Factor 4 (Appendix Figure S14)

5.1.3.1 Results from gene set enrichment analysis for factor 4.

df <- as.data.frame(gsea.out$pval.adj[,4])
colnames(df) <- c("LF_4")
df$pathway <- rownames(df)
rownames(df) <- NULL
df_sig <-filter(df, LF_4<0.001)
df_sig$pathway <- factor(df_sig$pathway, levels=df_sig$pathway[order(df_sig$LF_4, decreasing = T)])
df_sig <- filter(df_sig, pathway %in% tail(levels(df_sig$pathway),10))
gg_gsea <- ggplot(df_sig, aes(x=pathway, y=-log(LF_4))) + geom_point(position=position_dodge(.5)) +
  geom_linerange(aes(ymin=0,ymax=-log(LF_4)), linetype = "dashed",position=position_dodge(.5))+
  geom_hline(yintercept = -log10(0.01), linetype = "longdash") + 
  coord_flip() + ylab("-log p-value") + 
  scale_x_discrete(position = "top") +
  theme(plot.margin=margin(t=0,r=0,b=0,l=0.5,"cm"))+
  xlab("")
gg_gsea

5.1.3.2 Heatmap showing the expression top weighted gene

getTopWeights <- function(MOFAobject, view, factor, n=20){
  rownames(getWeights(MOFAobject, view,factor)[[1]])[order(abs(getWeights(MOFAobject, view, factor)[[1]]), decreasing = T)][1:n]
}
topWeights_mRNA_4 <- getTopWeights(MOFAobject, "mRNA", "4", 10)

anno_df <- data.frame("Factor 4"=Z[,"4"])
mRNA_4_hm <- plotDataHeatmap(MOFAobject, "mRNA", 4, transpose = T,
                                features= topWeights_mRNA_4,
                                sortSamples = T, 
                                annotation_col=anno_df, cluster_cols=F,
                                cluster_rows=TRUE,
                                treeheight_row = 0,
                                show_colnames=F)$gtable

gg_mRNA_4_hm <- as_ggplot(mRNA_4_hm)+theme(plot.margin=margin(t=0,r=-2,b=0,l=0.5,"cm"))

5.1.3.3 Scatterplots wit surface markers

df <- data.frame(CD8A =  MOFAobject@TrainData$mRNA["CD8A",],
                 CD8B =  MOFAobject@TrainData$mRNA["CD8B",],
                 CD3D =  MOFAobject@TrainData$mRNA["CD3D",],
                 CD300E = MOFAobject@TrainData$mRNA["CD300E",],
                 LF4= Z[,"4"])
df <- gather(df, value=value, key=gene, -LF4)
df_cor <- group_by(df, gene) %>% dplyr::summarize(cor=round(cor(LF4,value, use="complete.obs"),2),
                                                  xpos = quantile(value, 0.8,na.rm=T))
gg_cds <- ggplot(df, aes(x= value, y= LF4)) +geom_point()+
 facet_wrap(~gene, scales="free_x", nrow=1) + geom_smooth(method="lm")+
  geom_text(data = df_cor, aes(x = xpos, y = 0.5, label = paste("cor=",cor)), col="blue")+
  xlab("Normalized expression") + ylab("Factor 4")

Arrange plot

plot_grid(gg_gsea, gg_mRNA_4_hm, gg_cds, ncol=1, labels=letters[1:3], align="hv", axis="l", label_size = 24)
## Warning: Removed 256 rows containing non-finite values (stat_smooth).
## Warning: Removed 256 rows containing missing values (geom_point).

5.1.4 Factor 5 (EV Figure 2)

5.1.4.1 Beeswarm

plotFactorBeeswarm(MOFAobject, factors="5", color_by="TNF", showMissing = F) +ylab("Factor 5")

5.1.4.2 GSEA

LinePlot_FeatureSetEnrichmentAnalysis(gsea.out, "5", max.pathways = 7)

5.1.4.3 Gene expression

df <- getFactors(MOFAobject, "5", as.data.frame = TRUE)
anno <- mutate(df, Factor.5=value)
rownames(anno) <- anno$sample
anno <- select(anno, "Factor.5")

plotDataHeatmap(MOFAobject, view="mRNA", factor="5", features=6, transpose=T,
                cluster_cols=F, cluster_rows=F, main="", show_rownames=T, show_colnames=F, annotation_col=anno)

5.1.4.4 Drug response wieghts

plotDrugWeights(MOFAobject, "5", ntop = 9)

5.1.4.5 Drug response values

anno_df <- getFactors(MOFAobject, "5", as.data.frame = TRUE)
anno_df %<>% mutate(Factor.5 = value)
rownames(anno_df) <- anno_df$sample
anno_df %<>% select(Factor.5)

col <- brewer.pal(n = 9, name = "YlGn")

plotDataHeatmap(MOFAobject, factor = "5", features = c("SD51_4","SD07_3","MIS-43_4"), view = "Drugs", transpose = TRUE,
                col=col, annotation_col=anno_df,
                show_rownames = T, show_colnames = F,
                cluster_rows = F, cluster_cols = F)

5.1.5 Factor 7 (Appendix Figure S18)

5.1.5.1 Beeswarm

df = getFactors(MOFAobject, "7", as.data.frame = T)
df <- df[complete.cases(df),]
rownames(df) <- df$sample
df$del17p13 <- covariates[df$sample,"del17p13"]

par(mar=c(2.3, 4.5, 4, 2), xpd=TRUE)
colors <- colorRampPalette(rev(brewer.pal(n = 5, name = "RdYlBu")))(2)[as.numeric(cut(df$del17p13,breaks = 2))]
bs <- beeswarm(df$value, pwcol = colors, pch = 16, ylab = "Factor 7", xlab = "", cex.lab=1.5, cex.axis=1.5, cex.main=1.5, cex.sub=1.5)
legend("top", legend = c("(-) del17p13", "(+) del17p13"),
       title = "", pch = 16,
       col = unique(colors), ncol = 1,
       cex=1.5, inset = -0.28,
       box.lwd = 0, box.col = "white")

5.1.5.2 Methylation

# connect gene names for CpGs
MOFAobject_namedCpG <- MOFAobject
featureNames(MOFAobject_namedCpG)[["Methylation"]][which(featureNames(MOFAobject_namedCpG)[["Methylation"]]=="cg27150870")] <- "ANKRD11"
featureNames(MOFAobject_namedCpG)[["Methylation"]][which(featureNames(MOFAobject_namedCpG)[["Methylation"]]=="cg00981070")] <- "PRKC-Z (1)"
featureNames(MOFAobject_namedCpG)[["Methylation"]][which(featureNames(MOFAobject_namedCpG)[["Methylation"]]=="cg10057528")] <- "PRKC-Z (2)"
featureNames(MOFAobject_namedCpG)[["Methylation"]][which(featureNames(MOFAobject_namedCpG)[["Methylation"]]=="cg09179248")] <- "PRKC-Z (3)"
featureNames(MOFAobject_namedCpG)[["Methylation"]][which(featureNames(MOFAobject_namedCpG)[["Methylation"]]=="cg04071758")] <- "CREBBP (1)"
featureNames(MOFAobject_namedCpG)[["Methylation"]][which(featureNames(MOFAobject_namedCpG)[["Methylation"]]=="cg04336433")] <- "CREBBP (2)"
featureNames(MOFAobject_namedCpG)[["Methylation"]][which(featureNames(MOFAobject_namedCpG)[["Methylation"]]=="cg14703784")] <- "RASA3"
featureNames(MOFAobject_namedCpG)[["Methylation"]][which(featureNames(MOFAobject_namedCpG)[["Methylation"]]=="cg15032490")] <- "PAK1"
featureNames(MOFAobject_namedCpG)[["Methylation"]][which(featureNames(MOFAobject_namedCpG)[["Methylation"]]=="cg22245494")] <- "SYNM"
featureNames(MOFAobject_namedCpG)[["Methylation"]][which(featureNames(MOFAobject_namedCpG)[["Methylation"]]=="cg05277504")] <- "ASPSCR1"
featureNames(MOFAobject_namedCpG)[["Methylation"]][which(featureNames(MOFAobject_namedCpG)[["Methylation"]]=="cg18440230")] <- "ERGIC1"
gene_associated_features <- c("PRKC-Z (1)","PRKC-Z (2)","PRKC-Z (3)","CREBBP (1)","CREBBP (2)","RASA3","PAK1","SYNM","ASPSCR1","ERGIC1")

anno <- select(df, value)
plotDataHeatmap(MOFAobject_namedCpG, view="Methylation",
                factor="7",
                features=gene_associated_features,
                transpose=T, 
                show_rownames=T, show_colnames=F,
                cluster_rows=T, cluster_cols=F,
                annotation_col=anno)

5.1.5.3 Drug responses

plotDrugWeights(MOFAobject, "7", ntop=7)

5.1.5.4 Mutations

plotTopWeights(MOFAobject, "Mutations", "7", nfeatures=7, abs=T)

5.1.6 Factor 8 (Appendix Figure S19)

5.1.6.1 active views

r2out <- calculateVarianceExplained(MOFAobject)
r2df <- melt(r2out$R2PerFactor[c(8),,drop=F],
             varnames = c("Factor", "Omic"))
r2df$Factor <- as.character(r2df$Factor)
r2df$value <- pmax(r2df$value,0)
gg_r2 <- ggplot(r2df, aes(x=Omic, y=value, group=Factor, fill=Factor, label=Omic)) +
  geom_text(angle=90, col="black", nudge_y = 0.001, hjust=0)+
  geom_bar(stat="identity", position = "dodge") + ylab(bquote(R^2))+
  theme_classic() +
  scale_fill_manual(values=c("7"="darkgreen", "8"="navy")) +
  xlab("") + ylim(c(0,0.053)) + theme( axis.text.x = element_blank())+
  guides(fill="none")

5.1.6.2 Gene expression weights and enriched gene sets

getTopWeights <- function(MOFAobject, view, factor, n=20){
  rownames(getWeights(MOFAobject, view,factor)[[1]])[order(abs(getWeights(MOFAobject, view, factor)[[1]]), decreasing = T)][1:n]
}

topWeights_mRNA <- getTopWeights(MOFAobject, "mRNA", "8", 20)

gg_mRNA <- plotTopWeights(MOFAobject, "mRNA", "8", 20)
gg_mRNA <- gg_mRNA + ylab("Absolute loading on Factor 8")+ xlab("")+ theme_classic() +
  theme(plot.margin=margin(t=0.5,r=0,b=0,l=0,"cm"))


df_anno <- data.frame("Factor 8" = Z[,"8"])
rownames(df_anno) <- rownames(Z)
mRNA_hm <- plotDataHeatmap(MOFAobject, "mRNA", 8, transpose = TRUE,
                                features= topWeights_mRNA,
                                sortSamples = TRUE, 
                                annotation_col=df_anno,
                                cluster_cols=FALSE,
                                cluster_rows=FALSE,
                                # treeheight_row = 0,
                                show_colnames=F)$gtable

gg_mRNA_hm <- as_ggplot(mRNA_hm)+theme(plot.margin=margin(t=0,r=-3,b=0.7,l=0.5,"cm")) +
  theme(axis.text.y = element_blank()) + ylab("")

Enriched gene sets

df_gse <- data.frame(pvaladj=gsea.out$pval.adj[,8], pathway=rownames(gsea.out$pval.adj))
threshold=0.01
gg_gse <- ggplot(filter(df_gse, pvaladj<threshold),
                 aes(y=-log10(pvaladj), x=pathway)) +
  geom_point() + ylab("-log pvalue") +
  geom_hline(yintercept = -log10(threshold), linetype = "longdash") +
  geom_segment(aes(xend = pathway),yend = 0) +
  coord_flip() +
  theme_classic()+ xlab("") +
  scale_x_discrete(labels=c(gsub("s a", "s\na",gsea.out$sigPathways[[8]][1]),
                            gsub("d g", "d\ng",gsea.out$sigPathways[[8]][2])))

Assemble plot

gg_row1 <- plot_grid(gg_r2,gg_gse, ncol=2,
          labels = letters[1:2], rel_widths = c(0.8,1), axis="b", align="h")
gg_row2 <- plot_grid(gg_mRNA,gg_mRNA_hm, ncol=2,
          labels = letters[3:4], rel_widths = c(0.6,1))

grid.arrange(gg_row1, gg_row2, ncol=1, heights=c(0.7,1))

5.2 Association of the factors to time to treatment and prediction (Figure 4)

5.2.1 Retrieve survival data

Survival data is obtained from Dietrich, Oles, Lu et al. 2018 and the R package BloodCancerMultiOmics2017.

data(patmeta, package="BloodCancerMultiOmics2017")
survivalData <- as.matrix(patmeta[,c("T5","treatedAfter")])
survivalData<- survivalData[!is.na(survivalData[,1]),]
colnames(survivalData)<-c("time", "status")
  
# subset to common patients
commonPats <- intersect(rownames(Z), rownames(survivalData))
Zcommon <- Z[commonPats,]
survivalData <- survivalData[commonPats,]
covariatesCommon <- covariates[commonPats,]
dataCommmon <- lapply(data, function(dd) dd[,commonPats])

# construct survival object
SurvObject <-Surv(survivalData[,1],survivalData[,2])

5.2.2 Fit univariate Cox models for each factor

All covariates are scaled to ensure comparability of hazard ratio.This, however, looses interpretability of HR for 0-1 groups.

# fit a cox model for latent factors, scaling predictors
pval_LFs<-sapply(which(apply(Z,2,var, na.rm=T)>0), function(i){
    fit <- coxph(SurvObject  ~  scale(Zcommon[,i]))  
    s <- summary(fit) 
    c(p      = s[["coefficients"]][, "Pr(>|z|)"], 
      coef   = s[["coefficients"]][, "exp(coef)"], 
      lower  = s[["conf.int"]][, "lower .95"], 
      higher = s[["conf.int"]][, "upper .95"])
  })
colnames(pval_LFs) <- colnames(Z)[which(apply(Z,2,var, na.rm=T)>0)]

# collect in data frame
df_survival <- as.data.frame(t(pval_LFs))
df_survival$label <- paste(formatC(df_survival$p, digits = 2), sep="")
df_survival$predictor <- factor(paste("Factor",colnames(pval_LFs)), levels=rev(paste("Factor",colnames(pval_LFs))))
df_survival
##               p      coef     lower    higher   label predictor
## 1  2.623478e-05 0.6499277 0.5316406 0.7945330 2.6e-05  Factor 1
## 2  2.544566e-01 1.1407150 0.9095763 1.4305899    0.25  Factor 2
## 3  1.093300e-01 0.8489863 0.6948187 1.0373608    0.11  Factor 3
## 4  3.167230e-01 0.8902660 0.7090885 1.1177358    0.32  Factor 4
## 5  1.080130e-01 0.8565113 0.7090897 1.0345822    0.11  Factor 5
## 6  9.483115e-01 1.0065045 0.8273433 1.2244632    0.95  Factor 6
## 7  5.498890e-07 1.6282506 1.3454302 1.9705221 5.5e-07  Factor 7
## 8  1.538915e-09 0.4464297 0.3436397 0.5799665 1.5e-09  Factor 8
## 9  9.438812e-02 0.8427460 0.6896557 1.0298194   0.094  Factor 9
## 10 8.245706e-01 0.9714972 0.7523196 1.2545292    0.82 Factor 10
# re-orient to HR > 1
df_survival %<>% mutate(positive = coef>1)
df_survival %<>% mutate(y = ifelse(positive,coef, 1/coef))
df_survival %<>% mutate(ymin = ifelse(positive,lower, 1/lower))
df_survival %<>% mutate(ymax = ifelse(positive,higher, 1/higher))

Forest Plot of univariate associations

# TTT
ggplot(df_survival, aes(x=predictor, y=y,ymin=ymin,ymax=ymax))+
  geom_pointrange( col='#619CFF')+ coord_flip() +
  scale_x_discrete() + ylab("(Positive) Hazard Ratio")+ 
  scale_y_log10(breaks=c(0.75,1,1.5,2,3), limits=c(min(df_survival$ymin)-0.1,3.1)) +
  geom_hline(aes(yintercept=1), linetype="dotted") + 
  geom_text(aes(label=label, y=2.5),size=5, hjust = "left")+
  theme(text =element_text(size=18),
        axis.ticks.y = element_blank(),
        legend.position="bottom", panel.grid =element_blank(),
        panel.background = element_rect(fill="white"),
        strip.text = element_blank(),
        axis.text.y = element_text(size=16),
        axis.text.x = element_text(size=16),
        plot.title = element_text(size=18)) +
  guides(colour=F) +
  xlab("Factor") + scale_color_discrete(drop=FALSE)

5.2.3 Multivariate models for predicting survival

Harrals C-Index is used as performance measure, 5-fold stratified cross-validation. Both models using PCs and no penalization as well as using all features with a ridge approach are fitted here.

  #stratified CV to include same proportion of uncensored cases
  set.seed(1290)
  uncensored <- SurvObject[,"status"]==1
  cv_ix <- dismo::kfold(1:length(commonPats), k=5, by=uncensored)
  
  # Use same number of principal components as predictors as MOFA factors (without the intercept)
  topPC <- ncol(Z) - 1

  # impute missing data values by mean
  data_imputed <- lapply(dataCommmon, function(view) {
    apply(view,1, function(x) { x[is.na(x)] <- mean(x, na.rm=TRUE); return(x)})
  })
  
  # add concatenated data matrix
  data_imputed$all <- Reduce(cbind, data_imputed)
  
  # construct predictor list fo first 'topPC' PCs each
  FeatureList <- lapply(names(data_imputed), function(singleview){
    dat <- data_imputed[[singleview]]
    pc.out <- prcomp(dat)
    pc.out$x[,1:topPC]
  })
  names(FeatureList) <- paste(names(data_imputed),sep="")
  names(FeatureList)[grep("viab",names(FeatureList))] <- "DrugResponse"
  
  # add MOFA factors
  FeatureList$LF = Zcommon[,-1]

  # List of predictors using all features
  FeatureList_all <- data_imputed
  names(FeatureList_all) <- paste(names(data_imputed),"full",sep="_")

  # fit a cox model and predict on left-out fold
  fit <- lapply(unique(cv_ix), function(i){
    
  #fit coxph for reduced (i.e. MOFA facotrs and PCs)
  c = lapply(FeatureList, function(x) coxph(SurvObject[cv_ix!=i,]  ~ as.matrix(x)[cv_ix!=i,]))
  p = mapply(function(x,y) as.matrix(x[cv_ix==i,]) %*% coef(y), FeatureList, c)
      
  # fit cox model with ridge penalty for all
  c_all = lapply(FeatureList_all, function(x) 
  cv.glmnet(as.matrix(x)[cv_ix!=i,], survivalData[cv_ix!=i,], family="cox", alpha=0))
  p_all = mapply(function(x,y) as.matrix(x[cv_ix==i,]) %*% coef(y, y$lambda.min), FeatureList_all, c_all)
  p_all <- do.call(cbind, p_all)      
  colnames(p_all) <- names(FeatureList_all)
  
  # calculate CI    on left-out fold 
  p_joint <- cbind(p, p_all)
  CI=apply(-p_joint,2, rcorr.cens, SurvObject[cv_ix==i])[1,]
  list(CI=CI,c=c, c_all=c_all)
  })

  # get cross-validated CI
  concordanceCV <- sapply(fit, function(l) l$CI)
  colnames(concordanceCV) <- unique(cv_ix)
  df_hc <- melt(concordanceCV, varnames = c("predictor", "cv_idx"), value.name = "CI")
  df_hc$predictor <- ifelse(df_hc$predictor=="LF", "MOFA factors", as.character(df_hc$predictor))

  # calculate summary statistics across folds
  summaryHc <- aggregate(df_hc$CI,
    by = list(predictors = df_hc$predictor), 
    FUN = function(x) c(mean = mean(x), sd = sd(x),
                        n = length(x)))
  summaryHc <- do.call(data.frame, summaryHc)
  summaryHc$se <- summaryHc$x.sd / sqrt(summaryHc$x.n)

  summaryHc
##          predictors    x.mean       x.sd x.n          se
## 1               all 0.7409447 0.03234065   5 0.014463178
## 2          all_full 0.7430823 0.03650672   5 0.016326299
## 3             Drugs 0.6884616 0.06257370   5 0.027983810
## 4        Drugs_full 0.7102080 0.04400252   5 0.019678524
## 5       Methylation 0.7195992 0.01573335   5 0.007036167
## 6  Methylation_full 0.6941073 0.04752903   5 0.021255629
## 7      MOFA factors 0.7808409 0.06222417   5 0.027827496
## 8              mRNA 0.7148240 0.02933528   5 0.013119135
## 9         mRNA_full 0.7203712 0.03745143   5 0.016748789
## 10        Mutations 0.6938605 0.03800959   5 0.016998407
## 11   Mutations_full 0.6918274 0.08543174   5 0.038206234

Define colors for barplot

cols4survival <- c("#E69F00","#D55E00","#009E73", "#56B4E9",  "#999999","#0072B2")
names(cols4survival) <- c("Methylation", "Mutations","mRNA",  "Drugs", "all", "MOFA factors")

Main Plot 4(a): PCs as predictors Plot the resulting prediciton performance when principal componenets are included as predictors in a multivariate Cox model.

summaryHc_main <- filter(summaryHc,
                         predictors %in% c(names(data), "all", "MOFA factors"))
limits <- aes(ymax = summaryHc_main$x.mean + summaryHc_main$se,
              ymin = summaryHc_main$x.mean - summaryHc_main$se)
summaryHc_main$predictors <- factor(summaryHc_main$predictors,
                                    levels =c("Methylation", "Mutations","mRNA",  "Drugs",
                                               "all", "MOFA factors"))

# barplot
  ggplot(summaryHc_main, aes(x=predictors, y=x.mean, fill=predictors, group=predictors))+
    geom_bar(stat = "identity") + 
   coord_cartesian(ylim = c(0.5,0.85)) +
  scale_fill_manual(values=cols4survival, labels=names(cols4survival)) + 
  geom_errorbar(limits, position = "dodge", width = 0.25)+
  theme(axis.text.x = element_text(angle = 45, hjust = 1, size=18),
        axis.text.y = element_text(size=18),
         axis.title.y = element_text(size=18),
        strip.background = element_rect(fill="white"),
        strip.text = element_blank(),
        legend.position= "none",
        plot.title = element_text(size=20)) +
  ylab("Harrell's C-index") + xlab("")

Supplement S14: All features as predictors Plot the resulting prediciton performance when all features are included as predictors in a multivariate Cox model with ridge penalty.

summaryHc_supp <- filter(summaryHc,
                         predictors %in% c(paste(c(names(data), "all"), "full", sep="_"), "MOFA factors"))#, "factors 1,7,8"))
limits <- aes(ymax = summaryHc_supp$x.mean + summaryHc_supp$se,
              ymin = summaryHc_supp$x.mean - summaryHc_supp$se)
summaryHc_supp$predictors <- factor(sapply(summaryHc_supp$predictors, function(nm) sub("_full","",nm)),
                                    levels =c("Methylation", "Mutations","mRNA",  "Drugs",
                                               "all", "MOFA factors"))

# barplot
  ggplot(summaryHc_supp, aes(x=predictors, y=x.mean, fill=predictors, group=predictors))+
    geom_bar(stat = "identity") + 
   coord_cartesian(ylim = c(0.5,0.85)) +
  scale_fill_manual(values=cols4survival, labels=names(cols4survival)) + 
  geom_errorbar(limits, position = "dodge", width = 0.25)+
  theme(axis.text.x = element_text(angle = 45, hjust = 1),
        strip.background = element_rect(fill="white"),
        strip.text = element_blank(),
        legend.position= "none",
        plot.title = element_text(size=20)) +
  ylab("Harrell's C-index") + xlab("")

5.2.3.1 Kaplan-Meier plots

To visualize the association of the factor to clinical outcome. Samples are split into two groups based on the continious factors and a Kaplan Meier plot is made for those groups. Note that this is mainly for visualisation purposes, usually the continious factor should be considered as above in the Cox models.

  # calculate KM for all LFs
  glist <- lapply(as.character(1:(ncol(Z)-1)), function(lfno){
  
  #determine optimal cut-point and classify samples accordingly
  df <- data.frame(time=survivalData[,1], event = survivalData[,2], Zcommon)
  cut <- survminer::surv_cutpoint(df, variables=paste("X", lfno, sep=""))
  df$FactorCluster <- Zcommon[,lfno] > cut$cutpoint$cutpoint
  
  # due to rotational invariance of factors just use same colors for upper and lower group in the KM plot 
  # (irrescpecitve of which end of the factor they belong to)
  if(lfno==7) df$FactorCluster <- Zcommon[,lfno] < cut$cutpoint$cutpoint

  # fit survival model for the given factor
  fit <- survfit(Surv(time, event) ~ FactorCluster, df)
  ggLF <- survminer::ggsurvplot(fit, data =df,
                        conf.int = TRUE,
                        pval = TRUE,
                        fun = function(y) y*100,
                        legend = "none",
                        legend.labs = c(paste("low LF", lfno), paste("high LF", lfno)),
                        xlab = "Time to treatment",
                        ylab=ifelse(lfno==1, "Survival probability (%)", ""),
                        title= paste("Factor", lfno)
                        )
  ggLF$plot
 })
 
 # make joint plot of significant factors:
    grid.arrange(glist[[1]], glist[[7]], glist[[8]], ncol=3) 

5.2.3.2 Connections to clinical covariates

Code for comparing the MOFA prediction with clinical markers used for treatment decisions can be found here

5.3 Imputation

Imputation of missing values in MOFA is done via the imputeMissing function, the imputations are stored in the imputedData slot of the MOFAobject

MOFAobject <- imputeMissing(MOFAobject)

The code for the benchmarking of the imputation performance by masking varying amount of values or entire assays at random can be found here.

6 SessionInfo

sessionInfo()
## R version 3.4.4 (2018-03-15)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS High Sierra 10.13.2
## 
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_IE.UTF-8/en_IE.UTF-8/en_IE.UTF-8/C/en_IE.UTF-8/en_IE.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] bindrcpp_0.2.2                   MultiAssayExperiment_1.4.9      
##  [3] BloodCancerMultiOmics2017_0.99.6 Hmisc_4.1-1                     
##  [5] Formula_1.2-2                    lattice_0.20-35                 
##  [7] survival_2.41-3                  glmnet_2.0-16                   
##  [9] foreach_1.4.4                    Matrix_1.2-14                   
## [11] tidyr_0.8.0                      RColorBrewer_1.1-2              
## [13] ggbeeswarm_0.6.0                 pheatmap_1.0.8                  
## [15] ClusterR_1.1.1                   gtools_3.5.0                    
## [17] ggpubr_0.1.6                     beeswarm_0.2.3                  
## [19] cowplot_0.9.2                    magrittr_1.5                    
## [21] GGally_1.3.2                     ggplot2_2.2.1                   
## [23] gridExtra_2.3                    reshape2_1.4.3                  
## [25] dplyr_0.7.4                      MOFAtools_0.1                   
## [27] BiocStyle_2.6.1                 
## 
## loaded via a namespace (and not attached):
##   [1] backports_1.1.2            corrplot_0.84             
##   [3] plyr_1.8.4                 lazyeval_0.2.1            
##   [5] sp_1.2-7                   shinydashboard_0.7.0      
##   [7] splines_3.4.4              gmp_0.5-13.1              
##   [9] BiocParallel_1.12.0        GenomeInfoDb_1.14.0       
##  [11] digest_0.6.15              htmltools_0.3.6           
##  [13] tiff_0.1-5                 checkmate_1.8.5           
##  [15] memoise_1.1.0              cluster_2.0.7-1           
##  [17] doParallel_1.0.11          annotate_1.56.2           
##  [19] matrixStats_0.53.1         jpeg_0.1-8                
##  [21] colorspace_1.3-2           blob_1.1.1                
##  [23] ggrepel_0.7.3              xfun_0.1                  
##  [25] RCurl_1.95-4.10            genefilter_1.60.0         
##  [27] bindr_0.1.1                zoo_1.8-1                 
##  [29] iterators_1.0.9            ape_5.1                   
##  [31] glue_1.2.0                 survminer_0.4.2           
##  [33] gtable_0.2.0               zlibbioc_1.24.0           
##  [35] XVector_0.18.0             DelayedArray_0.4.1        
##  [37] BiocGenerics_0.24.0        abind_1.4-5               
##  [39] scales_0.5.0               mvtnorm_1.0-7             
##  [41] DBI_0.8                    Rcpp_0.12.16              
##  [43] cmprsk_2.2-7               xtable_1.8-2              
##  [45] htmlTable_1.11.2           magic_1.5-8               
##  [47] foreign_0.8-69             bit_1.1-12                
##  [49] km.ci_0.5-2                ipflasso_0.1              
##  [51] stats4_3.4.4               dismo_1.1-4               
##  [53] htmlwidgets_1.0            acepack_1.4.1             
##  [55] pkgconfig_2.0.1            reshape_0.8.7             
##  [57] XML_3.98-1.10              nnet_7.3-12               
##  [59] locfit_1.5-9.1             tidyselect_0.2.4          
##  [61] labeling_0.3               rlang_0.2.0               
##  [63] AnnotationDbi_1.40.0       munsell_0.4.3             
##  [65] OpenImageR_1.0.8           tools_3.4.4               
##  [67] RSQLite_2.1.0              ade4_1.7-11               
##  [69] broom_0.4.4                devtools_1.13.5           
##  [71] evaluate_0.10.1            geometry_0.3-6            
##  [73] stringr_1.3.0              FD_1.0-12                 
##  [75] ggdendro_0.1-20            yaml_2.1.18               
##  [77] knitr_1.20                 bit64_0.9-7               
##  [79] survMisc_0.5.4             purrr_0.2.4               
##  [81] nlme_3.1-137               mime_0.5                  
##  [83] compiler_3.4.4             rstudioapi_0.7            
##  [85] png_0.1-7                  tibble_1.4.2              
##  [87] geneplotter_1.56.0         stringi_1.1.7             
##  [89] psych_1.8.3.3              KMsurv_0.1-5              
##  [91] vegan_2.5-1                permute_0.9-4             
##  [93] pillar_1.2.1               data.table_1.10.4-3       
##  [95] bitops_1.0-6               raster_2.6-7              
##  [97] httpuv_1.3.6.2             GenomicRanges_1.30.3      
##  [99] R6_2.2.2                   latticeExtra_0.6-28       
## [101] bookdown_0.7               vipor_0.4.5               
## [103] IRanges_2.12.0             codetools_0.2-15          
## [105] exactRankTests_0.8-29      MASS_7.3-49               
## [107] assertthat_0.2.0           rhdf5_2.22.0              
## [109] SummarizedExperiment_1.8.1 DESeq2_1.18.1             
## [111] rprojroot_1.3-2            withr_2.1.2               
## [113] mnormt_1.5-5               S4Vectors_0.16.0          
## [115] GenomeInfoDbData_1.0.0     mgcv_1.8-23               
## [117] parallel_3.4.4             grid_3.4.4                
## [119] rpart_4.1-13               rmarkdown_1.9             
## [121] maxstat_0.7-25             Biobase_2.38.0            
## [123] shiny_1.0.5                base64enc_0.1-3